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Abstract 

Perfectly rational decision-makers maximize expected utility, but cru- 
cially ignore the resource costs incurred when determining optimal actions. 
Here we propose an information-theoretic formalization of bounded ratio- 
nal decision-making where decision-makers trade off expected utility and 
information processing costs. Such bounded rational decision-makers can 
be thought of as thermodynamic machines that undergo physical state 
changes when they compute. Their behavior is governed by a free en- 
ergy functional that trades off changes in internal energy — as a proxy for 
utility — and entropic changes representing computational costs induced 
by changing states. As a result, the bounded rational decision-making 
problem can be rephrased in terms of well-known concepts from statis- 
tical physics. In the limit when computational costs are ignored, the 
maximum expected utility principle is recovered. We discuss the relation 
to satisficing decision-making procedures as well as links to existing theo- 
retical frameworks and human decision-making experiments that describe 
deviations from expected utility theory. Since most of the mathematical 
machinery can be borrowed from statistical physics, the main contribution 
is to axiomatically derive and interpret the thermodynamic free energy as 
a model of bounded rational decision-making. 

1 Introduction 

In everyday life decision-makers often have to make fast and frugal choices 
[lljQ- Consider, for example, an antelope that quickly has to choose a direction 
of flight when faced with a predator. By the time an antelope had considered 
all possible flight paths to determine the optimal one, it would most probably 
be already eaten. In general, decision-makers seem to trade off the expected 
desirability of the consequences of an action against the effort and resources 
(time, money, food, computational effort, knowledge, opportunity costs, etc.) 
required for searching the optimum [^, . 

Classic theories of decision making generally ignore information-processing 
costs by assuming that decision makers always pick the option with maximum 



1 



return — irrespective of the effort or the resources it might take to find or com- 
pute the optimal action [1, 01 ■ Such decision- makers are described as perfectly 
rational. However, being perfectly rational seems to contradict our intuition of 
real-world decision-making, where information processing constraints play an 
important ro le llj . This has led to an abundant literature on bounded rational- 
ity d, 1^, [13, Unlike perfectly rational decision makers, bounded rational 
decision-makers are subject to information processing constraints, that is they 
may have limited time and speed to process a limited amount of information. 

1.1 Thermodynamic Intuition 
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Figure 1: The Molecule-In-A-Box Device, (a) Initially, the molecule moves 
freely within a space of volume V delimited by two pistons. The compartments 
A and B correspond to the two logical states of the device, (b) Then, the lower 
piston pushes the molecule into part A having volume V = pV. 



Here we follow a thermodynamic argument 12] that allows measuring re- 
source (or information) costs in physical systems in units of energy. The gen- 
erality of the argument relies on the fact that ultimately any real agent has to 
be incarnated in a physical system, as any process of information processing 



must always be accompanied by a pertinent physical process 13|. In the fol- 
lowing we conceive of information processing as changes in information states 
(i.e. ultimately changes of probability distributions), which consequently im- 
plies changes in physical states, such as flipping gates in a transistor, changing 
voltage on a microchip, or even changing location of a gas particle. Such state 
changes in physical systems are not for free, that is the do not happen sponta- 
neously. Consequently, if we want to control a physical system into a desirable 
state we also have to take into consideration that changing from the current 
state to the desirable state incurs a cost. 

According to Landauer's principle, one can postulate a formal correspon- 
dence between one unit of information and one unit of energy [13, Mm- 



Consider representing one bit of information using one of the following logical 
devices: a molecule that can be located either on the top or the bottom part of 
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a box; a coin whose face-up side can be either head or tail; a door that can be 
either open or closed; a train that can be orientated facing either north or south; 
and so forth. Assume that all these devices are initialized in an undetermined 
logical state, where the first state has probability p and the second probability 
1 — p. Now, imagine you want to set these devices to their first logical state. 
In the case of the molecule in a box, this means the following. Initially, the 
molecule is uniformly moving around within a space confined by two pistons as 
depicted in Figure [T^. Assuming that the initial volume is V, the molecule has 
to be pushed by the lower piston into the upper part of the box having volume 
V = pV (Figure [T}d). From information theory, we know that the number of 
bits that we fix by this operation is given by — log p. 

To make things concrete, we assume that the device has diathermal walls 
and is immersed in a heat bath at constant temperature T. Since the walls are 
diathermal, the temperature inside of the box is maintained at the temperature 
of the heat bath. We model the particle as an ideal gas. When an ideal gas 
is compressed under isothermal conditions from an initial volume to a final 
volume V, then the work is calculated as 

NkT V 

-—dV^NkTln^, (1) 

where > is the amount of substance and A: > is the Boltzmann constant. 
The minus sign is just a convention to denote work done by the piston rather 
than by the gas. If we assume A = 1 and make use of the fact that V' — pV 
we get 

V kT 

W = kT\n — = -fcTlnp = logp = -7nioilogp, 

pV loge 

where the constant 7moi := > can be interpreted as the conversion factor 
between one unit of information and one unit of energy for the molecule-in-a-box 
device. 

How do we compute the information and work for the case of the coin, door 
and train devices? The important observation is that we can model these cases 
as if they were like molecule-in-a-box devices, with the difference that their 
conversion factors between units of information and units of work are different. 
Hence, the number of bits fixed while these devices are set to the first state is 
given by — logp, i.e. exactly as in the case of the molecule. However, the work 
is given by 

-7coinl0gp, -Tdoorlogp, and -7trainl0gp 

respectively, where 7coin, 7door and 7train are the associated conversion factors 
between units of information. Obviously, 7nioi < 7coin < 7door < 7train- The 
point is that changes in knowledge states are costly and that these costs are 
proportional to the information. In the next section, we derive a general ex- 
pression of information costs in physical systems that make decisions. 
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2 Information- Theoretic Foundations 



2.1 Resource Costs 

We model any observable sequential process, such as a sequence of interactions 
or a sequence of computation steps, as a filtration on a measure space. To 
simplify our exposition, we consider only finite measure spaces. Let (fJ, E) 
denote a measurable space, where denotes the sample space and where S is 
a cr-algebra on J7. Let p be a conditional probability measure on (fi, S), such 
that for any two events A,B^T,, p{A\B) denotes the conditional probability of 
the A given B, where the condition B plays the role of the current information 
state of the process. The sequential realization of a process is modelled as a 
sequence of conditions Ai, A2, ■ . . , At on the sample space f2, where each new 
condition At refines the current information state Cl^-^t At by excluding the 
complement 

We further assume that a transformation of an information state from B to 
{A n B) entails a cost p{A\B) that could be measured in dollars, time or any 
arbitrary scale of effort. Moreover, we assume that this transformation cost is 
decomposable; that is, if we undergo a knowledge change from C to {Ar\Br\C), 
then we should pay the same cost as undergoing a change first from C to (BnC) 
and then from {B r\C) to ( A n S n C) . Finally, the quintessential information- 
theoretic postulate is that conditional probabilities impose a monotonic order 
over transformation costf0. We can sum up our postulates as follows: 

Definition 1 (Axioms of Transformation Costs). Let (il, E) be a measurable 
space and let p : (E x E) [0, 1] be a conditional probability measure over 
E (i.e. for any A g E, p{-\A) is a probability measure over A). A function 
p : (E X E) ^ R"*" is a transformation cost function for p iff it has the following 
three properties for all events A, B,C,D G E: 

Al. real-valued: 3/, piA\B) = f{p{A\B)) e R, 

A2. additive: p{A n B\C) = p(A|C) + p{B\A n C), 

A3, monotonic: [p{A\B) > p{C\D)] ^ [p{A\B) ^ p{C\D)]. 

These three properties enforce a strict correspondence between probabilities 
and transformation costs 3 3|- 

Theorem 1 (Transformation Costs O Probabilities). /// is such that p{A\B) = 
f{p{A\B)) for every choice of the probability space (fi, E,p), then f is of the form 

/(•) = -ilog(-), 

where j3 is a real parameter. 

^This intuition is central for optimal coding theory where short codewords are assigned 
to frequent events and long codewords are assigned to rare events [T^ . Therefore, we could 
regard the codeword length as a valuable resource that we have to bet on events with different 
probabilities. 
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That is, the transformation cost p(A\B) is proportional to the information 
content — \ogp{A\B), where the parameter /? plays the role of the conversion 
factor. The logarithmic mapping between probabilities and "costs" is well- 
known in information theory, and there are many possible ways to derive it 
Mmi. The important observation is that our derivation stems purely from 



postulates regarding transformation costs. 

According to Definition [1] transformation costs measure the relative cost of 
an event relative to a reference event. However, we can also introduce an absolute 
cost measure to single events such that transformation costs are obtained as 
differences. 

Definition 2 (Potential). Let p be a transformation cost function. A set func- 
tion (/) : S R is called a cost potential for p iff for all A, i? G S, 

<j){AnB) := (j){B) + p{A\B) VA,BgS, 
where (jjQ is an arbitrary real value. 

One can easily verify that this potential is well defined for all events, and 
that p{A\B) = <j>{A n B) — 4>{B). It captures the intuition that starting out 
from the high-probability event B with potential (piB) one has to pay the cost 
p{A\B) to arrive at the low-probability event AO B with potential (j){A n B). 

In the following, consider a reference set S E having a measurable parti- 
tion X. Cost potentials have an important recursive structure: the cost potential 
of an event is uniquely determined by the potential of its constituent events. 
If A" is a measurable partition of a reference event 5* G S, then 

0(5) = -ilog^e-W-). (2) 

Furthermore, the probability of a member a; G A" of the partition relative to S 
can be expressed as a Gibbs measure: 



In statistical physics it is well-known that the Gibbs measure satisfies a varia- 
tional principle in the free energy, which is defined as 

■■= E '?(^)'^(^) + ^ E li^)^ogq{x). (4) 

More specifically, it is well known that for any probability measure q over the 
partition X of S, 

F[q] >F[p] = -^ log ^{S), (5) 

where the lower bound is attained by the Gibbs measure p{x) oc e"'^'^*^^-'. Equa- 
tions (l2|) to ([5]) constitute fundamental results that will be generalized and 
interpreted in the next section. 
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2.2 Gains and Losses 



Equipped with the results from the preceding section, we can now proceed to 
model a bounded rational decision maker. Because transformation costs matter, 
we model a decision as a transformation of a prior behavior into a final behavior, 
where we represent the direction of change as a utility criterion. 

The Gibbs measure in ([3]) allows us describing a probability measure p over 
a partition X in terms of a cost potential (j) over X. In particular, we see that a 
decision- maker's a priori behavior or belief described by po{x) and 4>o{x) changes 
to p{x) and (j){x) if he is exposed to the gain (or loss) U{x), such that 

(t>{x) = M^) " U{x) (6) 

and 

p{x) oc 6-/3^0 (^)+/^c/(x) ^ p^^{^x)e^u(x) 

as illustrated in Figure 1. The function U represents either gains or losses and 
not absolute levels of costs, because it expresses a difference in the potential 
U (x) = 4'o{x)—(j){x). The equilibrium distribution ([7]) that arises in a change can 
also be characterized in terms of a variational principle, in a manner analogous 
to (P. 

Theorem 2 (Negative Free Energy Difference). LetpQ{x) andp{x) he the Gibbs 
measures with potentials (poix) and 0(x) and resource parameter j3. Let Fq and 
F be the free energies minimized by po and p respectively. Then, the negative 
free energy difference —AF — Fq — F is 

-AF=Y^ p{x)U{x) ~- Y. P^"^) (8) 

where U{x) = 4>o{x) — <j){x). 

Since the difference in the negative free energy — AF — F ~ Fq has the same 
dependency on p as the free energy F, we can use —AF directly as a variational 
principle in p. 

Corollary 3 (Variational Principle). The negative free energy difference pro- 
vides a variational principle for the equilibrium distribution, i. e. 

-AF[q] := ^ q{x)U{x) - ^ J] q{x) log ^ 

is maximized by 

P{x) = ^Po{x)e^'^^^\ where Z ■.= ^e^^^''\ 

Furthermore, 

AF[g]<AF[p] = ilogZ. 
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Figure 2: Representing a decision maker as a thermodynamic system, the be- 
havior of the decision-maker exposed to a gain U can be expressed as a change 
of his initial cost potential (jjo to a final cost potential (j), where (f) = (j)Q ~ U . 
The choice or belief probabilities of the decision-maker change according to (O 
from po to p. 



2.3 Choice & Belief Probabilities 

The distribution ([T]) can be interpreted both as an action or observation prob- 
ability in the context of bounded rational decision-making. In the case of ac- 
tions, Po represents the a priori choice probability of the agent which is refined 
to the choice probability p when evaluating the imposed gain (or loss) U . The 
associated change in probability depends on the resource parameter 13 and cor- 
responds to the computation that is necessary to evaluate the gains (or losses) . 
In the case of observations, po represents the a priori belief of the agent given 
by a probabilistic model, which is then distorted due to the presence of possible 
gains (or losses) that are evaluated by the holder of the belief. This way, model 
uncertainty and risk-aversion can be parameterized by /3. 

For different values of f3 the distribution ([7]) has the following limits 



/3 



lim p(x) — S{x — x*), x*=maxU{x) 
linip(x) = po{x) 
lim p{x) = S{x ~ X*), x*=mmU{x). 



In the case of actions the three limits imply the following: The limit /3 ^ oo 
corresponds to the perfectly rational actor that infallibly selects the action that 
maximizes gain (or minimizes loss — t/(a;). The limit /3 — > is an actor without 
resources that simply selects his action according to his prior. The limit /3 — >■ 
— cxD corresponds to an actor that is perfectly "anti-rational" and always selects 
the action with the worst outcome. In the case of observations the three limits 
correspond to an extremely optimistic observer (/3 oo) who believes only in 
the best possible outcome, an extremely pessimistic observer (/3 — oo) who 
anticipates only the worst, and a risk-neutral Bayesian observer (/3 0) who 
simply relies on the probabilistic model po- 
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2.4 The Certainty Equivalent 

In statistical physics [22|, the free energy difference 

l^A^ -Q 

measures the amount of available "good energy" (work W) by subtracting the 
"bad energy" (heat Q) from the total energy Ai? — E[C/]. The crucial physical 
intuition is that we have uncertainty about some aspects of the objects that 
make up the heat energy, for example we do not know the exact trajectories 
of all gas particles at temperature /?. This uncertainty means that we do not 
have full control over the objects and cannot extract all the energy as work 



12|. Economically speaking, the physical concept of work, and therefore also 
the difference in free energy, measures the certainty equivalent of a gain (or 
loss) that is contaminated by uncertainty. In general, we can therefore use the 
free energy difference to ascribe a certainty equivalent value to choice situations 
of the form ([7]) . As can be seen from Corollary [31 this value is given by the 
log partition function, i.e. the logarithm of the normalization constant Z . For 
different values of /?, the certainty equivalent takes the following limits 

lim — log Z ~ maxJ7(x) 

^-s-oo p X 

lim^logZ = Vpo(a;)t/(x) 

/3-^0 p — ' 

X 

lim — log Z — minC/(x). 

^— )• — oo p X 

Again, the case /3 — oo corresponds to the perfectly rational actor (or the 
extremely optimistic observer), the case /3 -> — oo corresponds to the perfectly 
"anti-rational" actor (or the extremely pessimistic observer) and the case /3 — >■ 
corresponds to the actor that has no resources (or the risk-neutral observer) such 
that the best one can expect is the expected gain or loss. 

Corollary [3] has two interpretations in statistical physics, either as an in- 
stantiation of a minimum energy principle or as a maximum entropy principle 
(2^ . Accordingly, ([7]) can either be seen as the distribution that maximizes the 
entropy given a constraint on the expectation value of U or as the distribution 
that minimizes the expectation oi —U given a constraint on the entropy of p. 
In the context of observer modeling, the first interpretation provides a principle 
for estimation and the second interpretation provides a principle for bounded 
rational decision-making in the case of acting, which is a maximum expected 
gain principle with a relative entropy constraint that bounds the information- 
processing capacity of the decision-maker. In the relative entropy we recognize 
the term \ogp{x) as our transformation costs p from Theorem [1] such that we 
can express the negative free energy difference — AF as 

-AF = E[C/] - E[i?], 

where U{x) — (j>a{x) — <j){x) represents gains (or losses) and R{x) = p{x) — pq{x) 
represents the extra resource costs required to achieve the gain (or loss) U. 
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We can therefore see how the variational principle of Corollary [3] formalizes a 
trade-off between expected gains (or losses) and information processing costs. 

3 Summary of Main Concepts 

In decision theory, choices between alternatives are usually formalized as choices 
between lotteries, where a lottery is formalized as a set X of possible out- 
comes, a probability distribution pq over X, and a real- valued function U over 
X called the utility function. In particular expected utility theory predicts that 
a decision-maker always chooses the lottery with the higher expected utility 
E[C^] = Sa;Po(a;)J7(x). Here we introduce the notion of a bounded lottery as a 
lottery that is additionally characterized by a resource parameter /3 G R that 
captures the resource constraints of the decision-maker. 

We have derived a thermodynamic framework for bounded lotteries from 
simple axioms that measure information processing cost — see also [l^. The 
most important difference of bounded decision-making compared to perfectly 
rational decision-making is that the bounded decision-maker will not be able 
to choose infallibly the best lottery. In fact, the resource constraints lead to 
stochastic choice behavior which can be characterized by a probability distribu- 
tion. The decision process then transforms an initial choice probability po into 
a final choice probability p by taking into account the utility gains (or losses) 
and the transformation costs. This transformation process can be formalized as 

p(x) = lpo(2:)e^^("), where Z = ^po(x')e''^("'^ (9) 

x' 

Accordingly, the choice pattern of the decision-maker is predicted by the prob- 
ability p. Crucially, the probability p extremizes the variational principle 

max j ^p(x)[/(.) - i ^p(x) log (10) 

These two terms can be interpreted as determinants of bounded rational decision- 
making in that they formalize a trade-off between an expected utility gain (first 
term) and the information processing cost of transforming po into p (second 
term) . The certainty equivalent value of a bounded lottery can be obtained by 
inserting the choice probability p from ^ into (jlOp , yielding 

^ = ^l°g(EPo(2:)e^^(^)), (11) 

^ X ^ 

which corresponds to the log partition sum. For different values of /?, the cer- 
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Figure 3: a) Negative free energy difference AF versus the resource parame- 
ter /3. The resource parameter ahows modehng decision-makers with bounded 
resources, either when generating their own actions (/? > 0) or when anticipat- 
ing their environment (/3 < 0). The negative free energy difference corresponds 
to the certainty equivalent, b) Distribution over the outcomes depending on 
the resource parameter /?. For large positive /3 the distribution concentrates on 
the outcome with maximum gain ?7max- For large negative /3 the distribution 
concentrates on the worst outcome with gain [/min- For 13 — the outcomes 
follow the given distribution po . 



tainty equivalent takes the following limits 

max U{x) 

X 
X 

min U (x). 

X 

The case (3 ^ oo corresponds to the perfectly rational actor (or the extremely 
optimistic observer), the case (3 ^ —oo corresponds to the perfectly "anti- 
rational" actor (or the extremely pessimistic observer) and the case (3^0 
corresponds to the actor that has no resources (or the risk-neutral observer) 
such that the best one can expect is the expected gain or loss. For illustration 
see Figure 2. 



lim -logZ = 

/3-s-oo p 

lim — log Z = 

p^o /3 
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4 Bounded Rationality and Satisficing 



Herbert Simon [23| proposed in the 50s that bounded rational decision-makers 
do not commit to an unlimited optimization by searching for the absolute best 
option. Rather, they follow a strategy of satisficing, i.e. they settle for an option 
that is good enough in some sense. Since then, it has been debated whether sat- 
isficing decision-makers can be described as bounded rational decision-makers 
that act optimally under resource constraints or whether optimization is the 



wrong concept altogether llj . If decision-makers did indeed explicitly attempt 
to solve such a constrained optimization problem, this would lead to an infinite 
regress and the paradoxical situation that a bounded rational decision-maker 
would have to solve a more complex (i.e. constrained) optimization problem 
than a perfectly rational decision-maker. 

To resolve this paradox, the bounded rational decision maker must not be 
able to reason about his constraints. He just searches randomly for the best 
option, until his resources run out. An observer will then be able to assign a 
probability distribution to the decision-maker's choices and investigate how this 
probability distribution changes depending on the available resources. Consider, 
for example, an anytime algorithm that will compute a solution more and more 
precisely the more time it has at its disposal. As one does not want to wait 
forever for an answer, the anytime computation will be interrupted at some 
point where one assumes that the answer is going to be good enough. This 
concept of satisficing can be used to interpret Equation [7] which describes the 
choice rule of a bounded rational decision-maker. 

Consider the problem of picking the largest number in a sequence Uo,Ui,U2, ■ ■ 
of i.i.d. data, where each [/j G W is drawn from a source with probability dis- 
tribution fi. This could be, for instance, an urn with numbered balls that we 
draw with replacement and we always keep track of the largest number seen so 
far. After m draws the largest number will be given by 

V := max{f/i, U2, ■ ■ ■ , U,^}. 

Naturally, the larger the number of draws, the higher the chances of observing a 
large number. The cumulative distribution function of choosing v after m draws 
is given by 

F,n{v) = Foivr, (12) 



where Fq is the cumulative distribution function of [2J| . If we only cared about 
finding the maximum with absolute certainty then we would need to draw an 
infinite amount of times. However, a bounded rational decision-maker would 
stop after a certain time, when he feels that the benefit of further exploration 
does not justify the effort of further drawings. Thus, the number of draws in 
this example can be regarded as a resource and the numbers on the balls can 
be regarded as utilities. The behavior of the bounded rational decision-maker 
is then stochastic even though he acts perfectly deterministically, in the sense 
that he chooses option v with probability (fT2|) given the resource constraint 



m. According to (jl2p . the more resources a decision-maker spends, the more he 
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Figure 4: a) Distributions over the maximum for various sample sizes (M + 
1). The distribution /i over the ten values v in iJ — {1,2,3,..., 10} follows a 
truncated Poisson distribution with parameter A = 5, as can be seen in the 
plot for M = 0. The distribution approaches a delta function over u = 10 for 
increasing values of M. b) The expected maximum v versus sample size (M + 1). 
The incremental gain of the expected maximum is marginally decreasing as the 
sample size increases (red). If the sampling process is associated with a cost — 
e.g. c = 0.02 per sample in the figure — , then the penalized expected maximum 
(black) reaches a unique maximum for a finite sample size — the optimal sample 
size is M = 35 in the figure. 



resembles a perfectly rational decision-maker that chooses the maximum number 
(Figure la), since the expected utility increases monotonically with the amount 
of resources spent (Figure lb). Importantly, however, note that the marginal 
increase in the expected utility diminishes with larger effort — hence larger and 
larger effort pays out less and less in the end. Below we formalize this trade-off. 

Here we show that the boundedness parameter /3 plays an analogous role to 
the number of draws m. In the limit of a continuous cumulative function Fq, 
the density after m draws is given by Pm{v) = ^'fo(w)™- We can now compute 
the log odds for two random outcomes v and v' , which results in 

1 Pm{v) , . Fq{v) fi{v) 

log TTT = ("^ ~ 1) log -ErriK + " 



where Fo(v) is again the cumulative of fi. If we require the probabilities Pm{v) to 
be representable by a distribution of the exponential family such that Pm(w) = 
/' ^^/^") °n''^"'^^rr) /^^ , wc scc that thc log odds have the following relation 

log ^ = a (C/(^^) - + log 44- 

We see that a and m play the role of the number of samples or computations. 
In general, the following theorem can be shown to hold. 
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Theorem 4. Let X be a finite set. Let Q and M be strictly positive probability 
distributions over X . Let a be a positive integer. Define Ma as the probability 
distribution over the maximum of a samples from M . Then, there are strictly 
positive constants S and ^ depending only on AI such that for all a, 



Ma{x) 



< e 



Consequently, one can interpret the inverse temperature as a resource param- 
eter that determines how many samples are drawn to estimate the maximum. 
Note that the distribution M is arbitrary as long as it has the same support 
as Q. This interpretation can be extended to a negative a, by noting that 
aU{x) = (— «)(— J7(a;)), i.e. instead of the maximum we take the minimum of 
—a samples. 



5 Sequential Decision-Making 

In the case of sequential decision-making the assumption of uniform temper- 
atures has to be relaxed — the proofs of the following theorems can be found 
in 25|. In general, we can then dedicate different amounts of computational 
resources to each node of a decision tree. However, this requires a translation 
between a tree with a single temperature and to a tree with different tempera- 
tures. This translation can be achieved using the following theorem 

Theorem 5. Let P be the equilibrium distribution for a given inverse tem- 
perature a, utility function U and reference distribution Q. If the temperature 
changes to 13 while keeping P and Q fixed, then the utility function changes to 

If we now define the reward as the change in utility of two subsequent nodes, 
then the rewards of the resulting decision tree are given by 

R{xt\x<t) ■■= [V{x<t)-V{x^t)\ 

= [t/(.,.)-C,(...,]-a-^)log^||^. 

This allows introducing a collection of node-specific (not necessarily time-specific) 
inverse temperatures /3(a;<t), allowing for a greater degree of flexibility in the 
representation of information costs. The next theorem states the connection 
between the free energy and the general decision tree formulation. 

Theorem 6. The free energy of the whole trajectory can be rewritten in terms 
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of rewards: 



F.[P]=|:P(.,.){c/(x,.)-llogg|gj} 

= U{e) + P{x<T)Y.\R{xt\x^t) - log 



Pixt\x<t) 
Q{xt\x<t) 



(13) 



This translation allows applying the free energy principle to each node with 
a different resource parameter /3(a;<t). By writing out the sum in ([T3l) . one 
realizes that this free energy has a nested structure where the latest time step 
forms the innermost variational problem and all other variational problems of 
the previous time steps can be solved recursively by working backwards in time. 
This then leads to the following solution: 

Theorem 7. The solution to the free energy in terms of rewards is given by 
P{xt\x<^t) = ^ Q{xt\x<t) exp|/3(a;<t) [R{xt\x<t) + ^7^— log Z(a;<f )] |, 
where Z{x<t) — 1 and where for all t < T 

Z{x<t) ^YQ{xt\x<t) eyi])\^l3{x<t)[R{xt\x<t) + -^^^^^ logZ(a;<t)] }• 

6 Limit Cases of Bounded Rational Control 

As described in the previous section, the belief and action probabilities of an 
agent in a sequential decision-making setup can be determined by recursion of 
the log-partition function 

^^^^iog|X!'3(^*l^<*)^^p{^(^<*)[^(^*l^<*) + ^(^<*)]}|7 

(14) 

where we have introduced V{x<t) — ttt' — T log Z(x<t). If xt is an action variable 
then Q{xt\x^t) reflects the prior policy and the agent's rationality /?(a;<t) deter- 
mines in how far the value R{xt\x<:t) + y ix<t) can be optimized by the agent. If 
Xt is an observation variable then Q{xt\x^t) reflects the agent's prior belief and 
the rationality of the environment /3(x<i) indicates how much one should de- 
viate from the prior belief considering the possible values R{xt\x^t) + V{x<ct)- 
Depending on /3{x^t), different decision-making schemes can be recovered — 
compare Figure 3. 

1. KL control. When assuming a history-independent loss function r{xt), 
Markov probabilities po(a^t|2;t-i) and /3(x<t) = /3 for all a:<t, Equation (fT4l) 
simplifies to a recursion that is equivalent to z-iteration which has previ- 



ously been suggested in [26|, |27| to approximately solve MDPs by means 
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Figure 5: Schematic ihustration of how resource parameters can model a range of 
decision-making schemes: (1)- risk-seeking, bounded rational; (2) risk-neutral, 
perfectly rational; (3) risk-averse, perfectly rational; and (4) robust, perfectly 
rational. 



of linear algebra — see 2^, 2^ for details of this equivalence relation. In 
[iil 27 1 the transition probabilities of the MDP are controlled directly 
and the control costs are given by the Kullback-Leiber divergence of the 
manipulated state transition probabilities with respect to a baseline distri- 
bution that describes the passive dynamics. In our framework, this kind 
of KL control corresponds to the special case where all random variables 
are action variables and the agent has boundedness parameter (3. The 
stochasticity in this case, however, is not thought to arise from environ- 
mental passive dynamics, but rather is a direct consequence of bounded 
rational control in a (possibly) deterministic environment. The continuous 
case of KL control relies on the formalism of path integrals 3^, 31 1 , but 
essentially the same relation to bounded rationality can be established — 
see [i^l for details. 

2. Optimal stochastic control. When assuming /3(x<i) oo for all action 
variables and I3{x^t) ^ for all observation variables, we approach the 
limit of the perfectly rational decision-maker in a stochastic environment. 
In this limit, the log-partition function converges to the expected utility 
and the decision-maker acts deterministically so as to maximize the ex- 
pected utility. For action variables, recursion (I14|) becomes the well-known 
Bellman Optimality Equation 32] — see 28, 29] for details. 

3. Risk-sensitive control. Risk-sensitive control [33] corresponds to a 
decision-maker with (3{x^t) ^ oo for all action variables and I3{x^t) 7^ 
for observation variables. Risk-sensitivity in the context of continuous 



KL control has been previously proposed in [3^. Mean- variance deci 
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sion criteria used in finance can be equally derived [35|. Risk-sensitive 
decision-makers do not simply maximize the expectation of the utility, 
but also consider higher-order cumulants by optimizing a stress function 
given by the log partition sum. A risk-averse decision-maker (/3(a;<t) < 0), 
for example, discounts variability off the expected utility. In contrast, 
risk-seeking decision- makers (/3(x<t) > 0) add value to the expected util- 
ity in the face of variability. Risk-sensitivity biases the beliefs about the 
environment optimistically (collaborative environment) or pessimistically 
(adversarial environment). Alternatively, one could regard a collabora- 
tive environment also as a bounded rational controller that can choose 
its own observation — that is the environment behaves like an extension of 
the agent with partial control. Importantly, the stress function is typically 
assumed in risk-sensitive control schemes in the literature, whereas here 
it falls out naturally — see [2^ for more details. 

4. Robust control. When assuming /3(x<f) 00 for all action vari- 
ables and f3(x^t) ^ —00 for all observation variables, we approach the 
limit of the robust decision-maker in an unknown environment. When 
P{x<it) —00, the decision-maker makes a worst case assumption about 
the environment, namely that it is strictly adversarial and perfectly ra- 
tional. This leads to the well-known game-theoretic minimax problem. 
Minimax problems have been used to reformulate robust control problems 
that allow controllers to cope with model uncertainties 3^ 37 1. Robust 



control pro blems have long been known to be related to risk-sensitive con- 
trol H, 3^ . Here we derived both control types from the same variational 



principle — see [29| for more details. 



7 Discussion 

In the proposed thermodynamic interpretation of bounded rationality, agents 
with limited resources search for a maximum over a set by randomly drawing 
elements from this set. This random search leads to a utility function that is 
marginally decreasing when more search effort is allocated. When such agents 
pay a search cost, the bounded rational optimum is to abort the search as 
soon as the marginal returns are equal to the search cost. The resulting trade- 
off between utility maximization and resource costs can be quantified by the 
KuUback-Leibler divergence with respect to an initial policy or belief. This ini- 
tial probability distribution corresponds to the initial state of a thermodynamic 
system that changes when a new potential is imposed. The difference in the po- 
tential corresponds to utility gains or losses in economic choice. The difference 
in the free energy corresponds to physical work and the economic certainty- 
equivalent. Thus, gains or losses that are associated with uncertainty are effec- 
tively devalued or overvalued, depending on the sign of the resource parameter. 
This way risk-sensitivity, robustness to model uncertainty and game-theoretic 
minimax-strategies can arise naturally. 
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Bounded rationality. Starting with Siraon [23|, bounded rationality has 
been extensively studied in psychology, economics, political science, industrial 
organization, computer science and artificial intelligence research — see for ex- 
ample 0, liil, m, m, M, El, H El- Additionally, numerous experiments in 
behavioral economics have shown that humans systematically violate perfect 
rationality, that is they are bounded rational [i^. Probably the most closely 
related approach to bounded rationality with res pec t to the present article is 
quantal response equilibrium (QRE) game theory [illiiliilH^l. QRE models 
assume bounded rational players whose choice probabilities are given by the 
Boltzmann distribution and whose rationality is determined by a temperature 
parameter. Interactions of such bounded rational players can lead to game- 
theoretic solutions that deviate from the Nash equilibrium. The QRE model 
is a special case of the model presented here where all prior probabilities are 
assumed to be uniform. These prior probabilities are crucial when defining the 
certainty-equivalent that ranges from minimum to maximum via the expected 
utility. As the certainty-equivalent corresponds to physical work, this also al- 
lows to relate bounded rational decision-making to thermodynamic processes. 
The distinction of a prior policy and a utility that is optimized to some extent 
is fundamental to the notion of bounded rationality proposed in this paper and 
therefore also affords a qualitative advance of the bounded rationality model in 
QRE models. 



Information theory in control and game theory. As already discussed, 
a number of papers have s ugg ested the use of the relative entropy as a cost 
function for control [2^ 0, [U . Previously, Saridis [HI] has framed optimal 
and adaptive control as entropy minimization problems. Statistical physics has 
also served as an inspiration to a number of other studies, for example, to an 



information-theoretic approach to interactive learning [5J|, to use information 



theory to approximate joint strategies in games with bounded rational players 



551 and to the problem of optimal search [56|, |57[ , where the utility losses 



correspond directly to search effort. Recently, Tishby and Polani [58| have 
shown how to apply information theory to understand information flow in the 
action-observation cycle. The contribution of our study is to devise information- 
theoretic axioms to quantify search costs in bounded optimization problems. 
This allows for a unified treatment of control and game-theoretic problems, as 
well as estimation and learning problems for both perfectly rational and bounded 
rational agents. In the future it will be interesting to relate the thermodynamic 
resource costs of bounded rational agents to more traditional notions of resource 
costs in computer science like space and time requirements of algorithms (Hol . 



Variational Preferences. In the economic literature the KuUback-Leibler 
divergence has appeared in the context of multiplier preference models that can 
deal with model uncertainty ^ . Especially, it has been proposed that a bound 
on the Kullback-Leibler divergence could be used to indicate how much of a 
deviation from a proposed model po is allowed when computing robust decision 
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strategies that work under a range of models in the neighborhood of po. In 
variational preference models [gOI] this is generalized to models of the form 



f ^ 9 iTiiii u(f)dp + c(j)) j > min u{g)dp + c{p) 

where c(p) can be interpreted as an ambiguity index that can explain effects of 
ambiguity aversion. The thermodynamic certainty-equivalent of work — computed 
as the log-partition sum — also falls within this preference model. However, an 
important difference is that the choice in a thermodynamic system is not de- 
terministic with respect to the certainty-equivalent, but stochastic following 
a generalized Boltzmann distribution. Due to this stochasticity of the choice 
behavior itself, the thermodynamic model can be linked to both bounded ratio- 
nality and model uncertainty, whereas variational preference models have so far 
concentrated on explaining effects of ambiguity aversion and model uncertainty. 

Ellsberg's and Allais' paradox. Two of the most famous deviations from 
expected utility theory that have been consistently observed in human decision- 
making are the paradoxa of Ellsberg [gJ] and Allais f62!| . While the first paradox 



has encouraged a large literature dealing with model uncertainty 37| , the latter 



paradox has led to the development of prospect theory [63|, |6J] . Ellsberg could 
show that human choice in the face of ambiguity differs from decision-making 
under risk where precise probability models are available. Humans typically 
tend to avoid ambiguous options, rather than choosing the option with higher 
expected utility. The observed ambiguity aversion can be modeled straightfor- 
wardly by a bounded rational decision-maker by allowing some degree of mini- 
maxing in the spirit of a risk-sensitive controller — see Supplementary Material 
for details. Allais could show that humans frequently reverse their preferences 
in choice tasks that may not lead to preference reversals according to expected 
utility theory. These reversals typically occur for different levels of riskiness of 
the same choices. The explanation of the Allais paradox within the framework of 
bounded rationality is not as straightforward as the Ellsberg paradox, but may 
involve context-dependent changes of the boundedness parameter or biases in 
the decision-making process that lead to a generalized quasi-linear mean model 
65 , 66l 67, 68 1, which provides an alternative account of preference reversals 



of the Allais type without violating the principle of stochastic dominance — see 
Supplementary Material for more details. 

Stochastic Choice. Stochastic choice rules have been extensively studied in 
the psychological and econometric literature, in particular logit choice mod- 
els based on the Boltzmann distribution [69,J70J- The literature on Boltzmann 
distributions for decision-making goes back to Luce [tiI , extending through Mc- 
Fadden [t^, Meginnis jlil, Fudenberg [zil and Wolpert [H [ll 



50|. Luce [71| 

has studied stochastic choice rules of the form pixi) ^ -^i^ — , which includes 

the Boltzmann distribution and the "softmax"-rule known in the reinforcement 
learning literature [7g]. McFadden [72] has shown that such distributions can 
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arise, for example, when utilities are contaminated with additive noise follow- 
ing an extreme value distribution. While stochastic choice models are generally 
accepted to account for human choices better than their deterministic coun- 



terparts [77|, uB, lZ9|, they have also been strongly criticized, especially for a 



property known as independence of irrelevant alternatives (IIA) . Similar to the 
independence axiom in expected utility theory, IIA implies that the ratio of 
two choice probabilities does not depend on the presence of a third irrelevant 
alternative in the choice set. What distinguishes the free energy equations from 
above choice rules is that stochastic choice behavior is described by a generalized 
exponential family distribution of the form p{x) ~ Po{x) exp(/?[/ (a:)). Changing 
the choice set might in general also change the prior ^0(2;), but more importantly 
it might also change the resource parameter /3. 



DifFusion-to-bound models. DifFusion-to-bound models typically model the 
process of binary decision- making as a random walk process that terminates once 
it hits one of two given decision bounds [sO] • Each time step of the random walk 
provides noisy evidence towards one of the two options. This implements a nat- 
ural speed-accuracy trade-off: the further away the bounds the more reliable the 
decision will be, as the noise can be averaged out, but also the longer one has to 
wait. The resulting choice probabilities are identical to the choice probabilities 
of a bounded rational decision-maker if we relate the decision bound of the ran- 
dom walk with the boundedness parameter in [7] — see Supplementary Material 
for details. The boundedness parameter can then also be shown to be propor- 
tional to the time required for the decision-making process. Decision-to-bound 
models have been widely used in behavioral psychology and neuroscience to 
explain probabilistic choice and reaction times in psychometric experiments — 
see [sH for a review. Decision-makers that apply the decision-to-bound model 
may be regarded as bounded rational decision-makers from a normative point 
of view. 



Free Energy Principle. A central property of closed thermodynamic sys- 
tems is that they minimize free energy. A free energy principle based on the 
variational Bayes approach has recently also been proposed as a theoretical 



framework to understand brain function [82, |83| . In this framework generative 
models of the form p{y\h, a) explain how hidden causes h in the environment and 
actions a produce observations y. The brain uses an approximative distribution 
Q{h; a) to determine the hidden causes. The free energy 

P = - J dhQ{h;a) \nP{y,h\a) - J dhQ{h- a) In Q{h; a) 

measures how well the brain is doing with this approximation. According to 
82 , 8^ , action and perception consist in choosing a and Q respectively so as to 
minimize this free energy. In light of the thermodynamic view of free energy, 
maximizing the likelihood ^hiP{y,h\a) — or minimizing surprise — is a partic- 
ular choice of potential function 0, where the boundedness consists in being 
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restricted to model class Q instead of having full disposal of p{y\h,a). More 
generally, variational Bayes methods that use particular classes of distributions 
to approximate the posterior could thus be regarded as a form of bounded in- 
ference within this picture. 

8 Conclusion 

Thermodynamics provides a framework for bounded rationality that can be 
both descriptive and prescriptive. It is descriptive in the sense that it describes 
behavior that is clearly sub-optimal from the point of view of a perfect rational 
decision-maker with infinite resources. It is prescriptive in the sense that it 
prescribes how a bounded rational actor should behave optimally given resource 
constraints formalized by /?. As we have argued in this paper, bounded rational 
decision-making provides an overarching principle in both senses in economics, 
engineering, artificial intelligence, psychology and neuroscience. 

A thermodynamic model of bounded rational decision-making has also two 
advantages over traditional decision theory of perfect rationality. First, it allows 
connecting computational processes with real physical processes, for example 
how much entropy they generate and how much energy they require [l3l |. Sec- 
ond, it suggests a notion of intelligence that is closely related to the process 
of evolution. It is straightforward to see that bounded rational controllers of 
the form © share their structure with Bayes' rule, where we identify the prior 
Po{x), the likelihood model e^^^^^ and the posterior p{x), normalized by the 
partition function, thus, establishing a close link between inference and con- 
trol [131 ■ Furthermore, both bounded rational controllers and Bayes' rule share 
their structural form with discrete replicator dynamics that model evolutionary 
processes [s^ , where samples (a population) are pushed through a fitness func- 
tion (likelihood, gain function) that biases the distribution of the population, 
thereby transforming a distribution po to a new distribution p. In this picture 
different hypotheses x compete for probability mass over subsequent iterations, 
favoring those x that have a lower-than-average cost. Just like the evolution- 
ary random processes of variation and selection created intelligent organisms on 
a phylogenetic timescale, similar random processes might underlie (bounded) 
intelligent behavior in individuals on an ontogenetic timescale. 
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